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Abstract 

The purpose of feature selection is to identify the relevant and non-redundant features fronn a dataset. In this article, the 
feature selection problem is organized as a graph-theoretic problenn where a feature-dissimilarity graph is shaped from the 
data matrix. The nodes represent features and the edges represent their dissimilarity. Both nodes and edges are given 
weight according to the feature's relevance and dissimilarity among the features, respectively. The problem of finding 
relevant and non-redundant features is then mapped into densest subgraph finding problem. We have proposed a 
multiobjective particle swarm optimization {PSO)-based algorithm that optimizes average node-weight and average edge- 
weight of the candidate subgraph simultaneously. The proposed algorithm is applied for identifying relevant and non- 
redundant disease-related genes from microarray gene expression data. The performance of the proposed method is 
compared with that of several other existing feature selection techniques on different real-life microarray gene expression 
datasets. 
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Introduction 

Data dimensionality reduction can be done in two ways: 1) 
feature extraction creates new feature by combining features and, 
2) feature selection choose subset of features by eliminating 
features with less or no predictive information. The center of 
attention of this proposed study is only on the feature selection. 
Feature selections have immense impact in improving the quality 
of classification and clustering technique in machine learning and 
pattern classification. The feature selection can be applied to both 
supervised and unsupervised learning. In a supervised scenario [1], 
[2] , the correct class of all training samples are additionally known 
and the feature evaluation criteria to generate selected feature set 
are based on the known class label of the features. In contrast, in 
unsupervised cases the assessment criteria are completely inde- 
pendent of the true class labels of the features. Performance in 
unsupervised classification is typically considered as the capability 
of a clustering algorithm to expose groupings (clusters) in a given 
data set. Subsequently, the clustering solution is evaluated using 
some cluster vahdation techniques like entropy (E), class separa- 
bility (S), fuzzy feature evaluation index (FFEI), etc [3]. Again 
feature selection may be filter-based or wrapper-based approach. 
When the utility of a feature is measured in terms of some proxy 
measure, then it is called filter-based feature selection. The proxy 
measure uses the class label in supervised filter-based approach. In 
unsupervised filter, the proxy measure considers the degree to 
which the distribution of the feature values exhibits the class 
structure in the feature space. Utility measures for wrapper 
methods [2] completely rely on a classifier or clustering result. As 



filter methods are independent of the classifier applied subse- 
quently, they have excellent generalization properties, but may be 
less effective at decreasing the dimensionality of the feature space 
and boosting classification accuracy. Generally, they are compu- 
tationally cheaper than the wrapper approaches. But wrapper 
based methods are more prone to have data over-fitting. The 
variety of feature selection technique has been addressed in quite a 
few ways such as clustering based [4], [5], content based [6], for 
ensemble classifier [7], graph based [8], [9] and feature similarity 
based [3]. 

In this context, two opposite strategies have been proposed in 
the literature: those that aim at the exclusion of redundant features 
[3] and those that focus on the elimination of irrelevant features 
[10]. Besides these methods there exist some Particle Swarm 
Optimization (PSO) based feature selection techniques in the 
literature. In [1 1], a multiswarm binary PSO has been introduced. 
A scheduling algorithm has been executed for selecting fittest 
subswarm where classification accuracy and fscore are combined 
as objective function. Then in [12], author used PSO and Least 
Square Support Vector Machine for feature selection and in [13] 
an improved PSO with signtest has been described for identifying 
relevant features. Again article [14] used bPSo but all these 
methods have been modeled as single objective fashion where 
classification accuracy has been considered as objective function. 
However, also there exist multiobjective PSO-based approaches 
like [15], [16] and [17] where MOPSO has been well studied but 
they did not consider the redundancy among features which 
should be minimized for reducing computation cost and improv- 
ing the performance. Therefore, the objective of feature selection 
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should be to select the most significant or relevant as well as non- 
redundant features. 

In this article we have proposed a novel graph-theoretic model 
for selecting most relevant and non-redundant features from the 
input dataset. In the proposed method, first a complete graph is 
shaped where the nodes symbolize the features and edge weights 
are defined by the dissimilarity among the features. Then we 
extract the densest subgraph from the feature-dissimilarity graph. 
The attributes contained by the extracted subgraph comprise the 
final selected relevant and non-redundant features. For identifying 
the densest subgraph, we have projei:ted a multiobjective binary 
particle swarm optimization (MO-bPSO) based algorithm. The 
particles are fashioned as binary strings for encoding the feature 
subset. Two objective functions, average node-weight and average 
edge-weight are optimized simultaneously. Unlike single objective 
optimization which yields a single best solution, multiobjective 
optimization (MOO) [18], [19] algorithms turn out a set of 
solutions which contains a number of non-dominated solutions, 
none of which can be further improved on any one objective 
without degrading it in another. Here the multiobjective 
optimization problem is tackled by applying bPSO [20] in which 
fitness comparison takes Pareto dominance [21] into account 
during the movement of the particles in the search space. The 
non-dominated solutions are stored in an archive to approximate 
the Pareto front [22]. 

In this proposed article, feature selection technique is apphed to 
identify relevant and non-redundant gene markers from micro- 
array gene expression data [23]. Microarray is a rapidly growing 
technology that provides the opportunity to assay the expression 
levels of genes in a single experiment. A microarray gene 
expression data set contains the expression levels of thousands of 
genes over a number of tissue samples. Hence this is a sample 
versus gene matrix which also contains the class label for each 
sample. Although recently it has gained popularity in the process 
of finding disease-related gene or marker, its high dimensionality 
and noise pose a challenging problem. Moreover some genes may 
not be very relevant to the corresponding class labels; hence they 
are not helpful for phc^notypc classification. In binary classification 
[24], the task of classification is done to the samples of the 
microarray dataset consisting of normal (benign) and cancer 
(malignant) tissue. Otherwise when samples represent three or 
more subtypes of cancer then classification [25] is called multiclass 
cancer classification. 

It is common in practice that in order to find the most relevant 
genes, most of the existing feature selection techniques [26], [27] 
produce a redundant set of genes. This fact has encouraged us to 
apply our proposc-d graph-based multiobjective binary particle 
swarm optimization technique which selects not only the relevant 
genes but a non-redundant set of genes also. The performance of 
the proposed technique is established on different real-life 
microarray gene expression data sets and compared with that of 
various existing gene selection techniques. 

Materials and Methods 

Other Relative Methods 

There are many more feature selection techniques in the 
existing literature establish their own superiority. In this article, we 
have taken some of them namely, T-test, Ranksum test, SFS, SBE, 
CPS, mRMR(MIQ), Graph-based feature selection Q and Cluster- 
based feature selectionQ. Moreover as our method is multiobjective 
one, so the singleobjective versions are also taken into account. By 
nature, the Sequential Forward Search (SFS) [28] selects features 
sequentially depending on the adopted criteria. On the contrary. 



Sequential Backward elimination (SBE) [29] discards features on 
the basis of the adopted criteria. Additionally, a methods like 
Correlation-based Feature Selection (CPS) [30] has been used for 
performance analysis. Here, the ratio of snr value to mean 
correlation value is considered as the criteria to calculate the 
features importance. The number of resultant features of our 
proposed approach is the input of the other comparative 
algorithms like T-test, Ranksum test, SFS, SBE, CPS and 
mRMR(miq). In case of T-test [31], and Ranksum test [32], 
[26], at first the p-values of the features are sorted and required 
numbers of features are taken for \ ali(latioii. In mRMR feature 
selection technique [33,34], the relevance of gene is calculated by 
mutual information [35] between a feature and its corresponding 
class labels and redundancy is computed as the mutual informa- 
tion among the features. The basic concept of mRMR is to select 
the genes such that they are relevant and mutually maximally 
dissimilar to each other at the same time. Let ^ denotes the subset 
of genes that we are seeking. The average minimum redundancy is 
given as Equation 1: 

Minimum ir=^^/(y), (1) 

where I(iJ) presents the mutual information between i-th gene 
andj-th gene and \s\ is the number of genes in S. The discriminant 
power of a gene by the mutual information I(h,gi) is calculated as 
per Equation 2. That means the mutual information between 
targeted classes h = h\,h2, - • • ,hk and the gene expression g, is the 
measure of relevance of that gene. Thus the maximum relevance 
condition is to maximize the average relevance of all genes in i is 
Equation 2: 



Maximum V=y-^ I(h,i). (2) 

Therefore, the redundancy of a gene has to be minimized and 
relevance of a gene has to be maximized. As two conditions are 
equally important, two simplest combined criteria are: 
Max{V-W), and Max(V/W}. Here only the mRMR for 
discrete variable in form of mRMR mutual information quotient 
(mRMR MIQ) is described. The mRMR with MIQ, scheme is 
formulated as per Equation 3. 

mRMRMiQ=maxteQ, |/(/,/!)/[|^ ^ I{iJ)\ | . (3) 

Next, in Graph-based feature selection method [36], a graph 
G = ( F X £) has been constructed with node-set V, edge-set 
E^V xV and edge weight matrix W whose elements are in the 
interval [0; 1]. Each vertex represents a feature and the edge 
between two features represents their pair wise relationship. The 
weight on the edge reflects the degree of relevance between two 
features. Therefore, the graph G with the corresponding edge- 
weight or weighted relevance matrix has been formed. The 
algorithm states: a) computing the relevance matrix W = {Wij)„y^„ 
based on the mutual information between feature vectors, b) 
dominant-set clustering to cluster the feature vectors and c) 
selecting the optimal feature set from each dominant set using the 
multidimensional interaction information (Mil) criterion. There- 
fore, in Cluster-based feature selection method [37], the feature set 
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is partitioned into clusters of similar features where the number of 
clusters and the cardinaUty of the subset of selected features, is 
automatically estimated from the data. But this method relies on 
some user defined parameters. 

Multiobjective Optimization (MOO) and Problem 
Description 

In this section first the basic concepts of multiobjective 
optimization are described. Subsequently, the formulation of gene 
selection problem as multiobjective optimization problem is 
described. 

MOO concepts. In many real world problems, there exist 
dilferent aspects of solutions which are partially or wholly in 
conflict. Therefore, treating those problems as single objective 
optimization produces an unreliable result. In multiobjective 
optimization problem the objectives may estimate those different 
aspects of solutions which are conflicting in nature. The multi- 
objective optimization can formally be stated as follows [18], [19]. 
Find the vector x* = [x\,X2, . . . jX'J'^^of decision variables which 
satisfies m inequality constraints: 



g, (^)>0, i=l,2,...,m, 
and p equaUty constraints: 

h, (x) = 0, i= \ ,2,...,p, 
and optimizes the vector function: 

/(x)=[/i(x),/2(x),...,A(x); 



(4) 



(5) 



(6) 



The constraints in Equation 4 and 5 define the feasible region T 
which contains all the allowable solutions. Any solution outside 
this region is inadmissible since it violates one or more constraints. 
The vector x* denotes an optimal solution in J^. 

The essence of multiobjective optimization technique can be 
determined through Pareto optimality [21]. Pareto optimal set 
comprises of all those solutions for which it is impossible to 
improve any objective without simultaneous worsening in some 
other objective. It can be said that a vector of decision variables 
x*ej- is Pareto optimal if there does not exist another x* such that 
fiix)<fiix*) for all / = 1, ... ,^ and //(x) <^(x*) for at least onej 
when the problem is minimizing one. Here, Jf^ denotes the feasible 
region of the problem (i.e., where the [:onstraints are satisfied). 
Pareto optimal set [22] generally contains more than one solution 
because there exist different 'trade-off solutions to the problem 
with respect to different objectives. The set of solutions contained 
by Pareto optimal set are called non-dominated solutions. The plot 
of the objective functions whose non-dominated vectors are in the 
Pareto optimal set is called the Pareto front [22]. Specifically 
MOO is a process of generating the whole Pareto front or an 
approximation to it. 

Problem description. In this article the target is to find non- 
redundant but relevant features from a data matrix. In other 
words the resultant features are not only non-correlated but 
significant too. So the prohk-m should be defined in such a manner 
that the correlated and irrelevant features are not selected. In our 
proposed scheme, the problem is equivalent to finding most dense 
subgraph from a weighted undirected graph. The arrangement of 
the data matrix can be viewed as a two-dimensional matrix where 
the rows indicate instances and columns indicate attributes or 



features. One additional column is there for presenting the 
corresponding class labels of the instances. A range of some 
similarity/dissimilarity measures includes correlation coefficient 
[.38], Euclidean distance [,39] and maximal information compres- 
sion index [3] etc. Using one of these dissimilarity (negative 
similarity) measures the symmetric matrix is generated which is 
termed as a dissimilarity matrix. Let the data set has n features, 
P = {fl/2j3,---/n}- Calculating pairwise negative similarity be- 
tween features of the feature set F manipulates (n x n) symmetric 
dissimilarity matrix Sm. Therefore from this dissimilarity matrix Sm 
a weighted complete graph G can be formed. Since a node 
represents a feature, so the vertex set of the graph G is 
V = {flj2j3,---/n}, i-e., the graph contains total n nodes. The 
value at row i and column j in the dissimilarity matrix Sm, 
represents the weight of the edge between node/ and fi. As each 
feature has some dissimilarity value with every other feature 
(present in dissimilarity symmetric matrix Sm), hence the graph G 
is a complete graph. Fig. 1 demonstrates the process of conversion 
from data matrix to feature-dissimilarity graph. First the dissim- 
ilarity matrix (for edge weight) is calculated for the data matrix 
using correlation coefficient between each pair of gene. The 
correlation coefficient ff between two random variable x andj; can 
be defined as [38]: 



aix,y) -- 



cov(x,y) 
\ar{x)\ar{y) ' 



(7) 



where var() denotes the variance of a variable and cov(x,y) the 
covariance between the variables. If x and y are completely 
correlated, i.e., exact linear dependent exist, then a{x,y) is 1 or — 1 
and if totally uncorrelated then a(x,y) is 0. Hence (1 — |(t(x,7)|) 
represents the dissimilarity between x andjv. Subsequentiy, a graph 
G is formulated from the dissimilarity matrix. Let the samples are 
belong to either class 1 (denoted by cl) or class2 (denoted by c2). 
Then the signal-to-noise ratio (SNR) value (node weight) 
corresponds to each feature (/i) is calculated using mean and 
standard deviation (s.d.) of classl samples (cl) and class2 samples 
(c2) and defined as [40] : 



\SNRi\ 



mean(fi{c\)) — mean{fi{c2)) 



s.d.{ft(c\)) + s.d.{fic2)) 



(8) 



The SNR describes the ratio of the relative mean to the sum of 
Standard Deviation of two classes of samples. Basically, it describes 
the difference between central tendency and variation or 
dispersion exists from the average value of the data points. A 
low SNR indicates that the feature does not have much difiFerent 
values in different classes. Whereas, high SNR indicates that the 
feature values are spread out over a large range of values and it is 
expected that the values are different in different classes. Very low 
SNR may be considered to be insignificant to the class labels and 
high SNR value means feature is highly differentially expressed. 
Therefore the SNR value is tieated as feature relevance. For the 
graph G larger edge weight means that the features connected by 
that edge are more dissimilar and larger node weight means 
features are more relevant. Thus finding the most dense subgraph 
g from graph G is equivalent to finding the non-redundant and 
most relevant feature set, as the features (nodes) enclosed by the 
subgraph g, will have maximum average edge weight (dissimilarity) 
and maximum average node weight (SNR). Therefore the 
problem can be defined as: find the most densest subgraph (g) 
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m X n data matrix nx n dissimilarity matrix A complete eraph of n vertices 



Figure 1 . Construction of Feature-dissimilarity Graph. From tine data matrix first Relevance Vector and Dissimilarity IVlatrix are Computed, then 
a weighted complete Feature-dissimilarity Graph is computed. Here an example of 5 feature-dissimilarity graph is depicted. 
doi:1 0.1 371 /journal.pone.0090949.g001 



from a complete weighted graph G. Thus the features present in 
the reduced subgraph g are the required output of our proposed 
technique. Here we have developed a multiobjective bPSO to 
address this problem. 

Proposed Multiobjective Binary PSO-based Approach 

Particle Swarm Optimization (PSO) [41], [42] is a well known 
swarm-based optimization techniques which optimizes a problem 
by iteratively trying to get better candidate solutions with respect 
to a given fitness measure. In PSO, a set of particles or candidate 
solutions traverse the search space with a velocity based on their 
own experience and the experience of their neighbors. During 
each traversal, the velocity and thereby the position of the particles 
are restructured. This process is repeated untU some stopping 
criteria are met. Unlike other classical optimization techniques 
which tend to have premature convergence to local optimal 
solution, PSO is known for globalized searching. 

In this article, the input data matrix is first transformed into a 
weighted undirected complete feature-graph, where the nodes 
(having relevance as node weight) symbolize the genes and the 
edges are weighted according to the dissimilarity of genes. In each 
iteration, a reduced subgraph is computed for which the average 
relevance and average dissimilarity among the genes contained by 
the reduced subgraph are maximized. Therefore, the densest 
subgraph having maximum average weight (node-Hedge) is 
identified by applying binary PSO [20]. The bPSO is applied to 
multiobjective optimization and with the help of non-dominated 
sorting [43] and Crowding Distance measure [18], small set of 
non-redundant informative genes is identified. 

Particle encoding. Here the population is called swarm and 
it consists of m number of candidate solutions or particles. Each 
particle has n cells where n is the total number of genes comprises 
the data matrix i.e., each cell signify one gene from the data 
matrix. The cells can have values either 0 or 1 . If the f-th cell of a 
particle has value 1 then f-th gene is selected from the dataset, 
otherwise it is ignored. 

Initialization. Initially each cell of a particle is either 0 or 1 
chosen randomly. After the initial particles are chosen, their 
corresponding fitness values are calculated. Then the velocity of 
each cell of the particle is initialized to zero. For each dataset, the 
algorithm is executed for 100 iterations. The input of the proposed 



system, i.e., the swarm size is set to 25 and the weighting factors cl 
and c2 which are cognitive and social parameters respectively are 
set to 2. 

Fitness computation. Here two objectives, average dissim- 
ilarity (negative correlation) and average signal-to-noise values are 
maximized. Each particle form a reduced subgraph for which 
average negative correlation (avgjicorr) and average SNR value 
(avgjnr) are computed. As the bPSO algorithm is designed as 
minimization problem, so fitness values are computed as 
(1 — avgjicorr) and (1 —avg_snr). Then cells are iterated as usual 
PSO evaluation [44]. Now for calculating fitness values of a 
particle, those genes are selected for which representing cells have 
value 1. Therefore, these selected genes of the corresponding 
particle forms a subgraph g[v,e,vw,ew] where v is the set of nodes, e 
is the set of edges, vw is a vector of node weights by computing 
SNR value for each node and ew is a edge weight matrix calculated 
by (1 -correlation) between each pair of nodes. Thereafter, 
avgjicorr (Equation 9) and avg_snr (Equation 10) are defmed as 

2 

avgjnr= ^' = \ . (10) 



Updating position and velocity. As each cell represents one 
gene, so here the two terms cell and gene are used interchange- 
ably. The position of a gene within a particle contains either 0 or 
1, and velocity of each gene is initialized to zero. Using the 
information obtained from the previous step the position and 
velocity of each particle are updated. Each particle keeps track of 
the best position it has achieved so far in the history, and this best 
position is also called pbest or local best. In multiobjective 
perspective, that position is chosen for pbest for which fitness of 
that particle dominates other fitnesses acquired by that particle in 
the history, if there is no such fitness then random choice is done 
between current and previous position of that particle. The best 
position among all the particles is called global best or ghest which 
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is randomly chosen from the archive of non-dominated candidate 
solutions. Actually whenever a particle moves to a new position 
with a velocity, its position and velocity are altered according to 
the Equations 11 and 12 given below [20]: 



v,y(/ -1-1) =w* Vij(t) -I- ci * ri * (pbestij(t) - 
Xijit)) + C2 * r2 * {gbestij{t)-Xij{i)), 



Xij{t+\) = Xij{i) + Vij{t+\). 



(11) 



(12) 



Here t is the time stamp and 2-th particle andj-th position are 
considered. In Equation 11 new velocity (yij{t+\)) is acquired 

using velocity of previous time (v'//(?)), pbest and gbest. Then new 
position {Xij{t-\- 1)) is obtained by adding new velocity with current 
position [Xij(t)) as shown in Equation 12. r\ and ^2 are two random 
value in the range of 0 to 1 . w in Equation 1 3 is the inertia weight 
which is computed as: 



= (1.1-^). 

pbest 



(13) 



Updating archive. The repository where the non-dominated 
population in the history is reserved called archive. First the 
archive A is initialized with non-dominated population of P, . Next 
for updating the archive A, the next generation population Pi+\ is 
merged with the archive Ai i.e., ^,+1 =yii-|-Pi+i and then non- 
dominated solutions are yielded by applying non-dominated 
sorting and crowded distance sorting to the combined archive 
Ai^\. The non-dominated sorting and crowded distance sorting 
are evaluated for this combined population to obtain better 
diversity of the Pareto optimal front. 

Proposed MObPSO algorithm. Here, the proposed multi- 
objective binary particle swarm optimization (MObPSO) is 
designed for maximizing the dissimilarity (negative correlation) 
and SNR, which are represented as edge weight and node weight, 
respectively. The adopted graph based MObPSO technique is 
illustrated in Table 1 Algorithm 1 . The population is initialized by 
arbitrarily selected features from the data matrix and population 
fitness vedues are calculated using Equation 9 and Equation 10. 
The archive A is initialized by the population after non-dominated 
sorting of the primary population. Velocity and position are 
updated using Equations 1 1 and 1 2 respectively. Local best P is 
updated comparing the current fitness and previous fitness of a 
particle and global best G is updated according to random picking 
of particle from the archive. After updating the position and 
velocity, the archive is added with next generation solution and 
then non-dominated sorting [43] and crowding distance [18] 
sorting are used to revise the extended archive. These steps are 
repeated for particular number of iterations. 

Results and Discussion 

In this section, we first describe the real-life datascts and their 
preprocessing procedure, thereafter portray the performance 
metrics followed by the results of difierent algorithms. 

Datasets and Preprocessing 

In this article three real-life gene expression datasets are used 
which are publicly available from the following website: www. 
biolab.si/ supp/bi-cancer/ projections/info/. 



Prostate. Gene expression measurements for samples of 
prostate tumors and adjacent prostate tissue not containing tumor 
were used to buUd this classification model. It contains 50 normal 

tissues and 52 prostate tumor samples. The expression matrix 
consists of 12533 number of genes and 102 number of samples. 

DLBCL. Diffuse large B-cell lymphomas (DLBCL) and 
follicular lymphomas (FL) are two B-ceU lineage maUgnancies 
that have very different clinical presentations, natural histories and 
response to therapy. Total 7070 genes are there in the dataset. The 
number of samples of type DLBCL is 58 and of type FL is 19. 

GSE412 (Child-ALL). The childhood ALL dataset (GSE412) 
includes gene expression information on 110 childhood acute 
lymphoblastic leukemia samples. The dataset has 50 examples of 
type before therapy and 60 examples of type after therapy. The 
number of genes is 8280. 

The above described two-class datasets can be obtained as 
matrix format whose columns are genes and rows are samples and 
preprocessed by SNR (Equation 8) for each gene (column). The 
genes (column) of the data matrix are sorted according to the 
decreasing order of obtained ISATRj. Lastiy from the data matrix 
top 100 genes are taken. After that the data matrix is normalized 
to set each gene expression value in the range from 0 to 1 . 

Score Analysis 

Performance is evaluated using sensitivity, specificity, accuracy, 
fscore, AUG and average correlation. The entire dataset is divided 
into two different sets: training and test set. The proposed 
approach is applied on the training data. Therefore, a set of non- 
dominated candidate solutions are obtained. After that, for final 
marker genes assortment, we employ the BMI-score [45] which 
considers the discriminative power of each gene by incorporating 
the true positive rate from logistic regression. In mathematical 
terms, let us assume a data set D consisting of two groups 'control 
(ctr)' and 'experiment (exp)'. BMI assigns a score for a feature x 
defined as follows: 



BMI{x) = X.TF^\\k 



CVctr 

cv ' 



(14) 



where 



ifA> = l 
otherwise 



(15) 



Here, ). is a scaling factor and TP^ is the product of the true 
positive {TP) rates determined for each group using logistic 
regression. CVctr and CV denote the coefficient of variance for the 
feature x in the 'control' group and in both groups, respectively. 
Also, A = x/Xctr, where Xctr, and x denote the mean value of x in 
'control' and in both groups, respectively. The maximum BMI- 
score generating candidate solution is considered as the most 
informative solution. The performance of the proposed algorithm 
is compared with that of its single objective versions and other two 
statistical tests like T-test and Wilcoxon Ranksum test. The 
datasets are arhitrarih' (li\'idcd into two s(;ts: training set and test 
set. This process is repeated 10 times and we got 10 train sets and 
their corresponding 10 test sets. Each of the algorithms is executed 
for each train file and evaluated with the corresponding test file. 
Thus for each algorithm, we got 10 sensitivity, 10 specificity, 10 
accuracy and 10 _F-score values. Now the average of these 10 
values for each performance metric with standard deviation are 
computed and tabularized. 
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Table 1. Algorithm 1: Graph based MObPSO (Minimization Problem). 



Input: data matrix dt, C = number of genes, TV = number of particles, threshold thr = 0.9, Graph G = [V,E,VW,EW] designed from dissimilarity matrix 5m. 
Output: archive A 

1: [x„,v„,G„,/*„]^^j : = initialize((i/) > Random locations and velocities 

2: gif[V\,E\,VW\,EW\] = Sm{Vl^ x V\^) >subgraphs for A'^ particles are formed from dissimilarity matrix Sm 

I 



/i=i- 
V 



2 



4: /2=1 



S:A : =x„ {\\ fitnesses{x,^ fitnesses{u)yueA 
6: for n : = \ : N do 



>average dissimilarity value for the subgraphs 

>average snr value for the nodes contained by A'^ subgraphs 
> Initialize archive A by first non-dominated .y„ 



7: 


for : = 1 : C do 




8: V,,,/ : =^r.^•„J + l■l.(P„J-x„J) + r2.^G„J-x„J) 


9: 






10 


if A,,,; > = thr then 


11 


x„,l : = 1 


t> discretize the cell value 


12 


else 


13 


if .Y„,; < thr then 


14 


x„j : =0 


15 


end if 


16 


end if 


17 


end for 


18 


end for 


19 


for « : = 1 : Ndo 


20 


g„\V\,E\,VW\,EW\\ = Sm{V\„ x V\„) 


> new subgraph produced by the evaluated particles 


21 


|K1|.(|F1|-1) 
2 


>average dissimilarity value for the new subgraph 


22 


\v\\ 


>average snr value the nodes contained by the new subgraph 


23 


A : =A[Jx„ 


>Add x„ to A 


24 


for /f : = 1 : Af do 




25 


if {fifnesses(x„k) fitnesses{P„)) then 


[> Update personal best 


26 


P. ■■ =x„ 




27 


if Non-dominated fitnesses then 




28 


Pn : = Random — choice[x„,P„] 




29 


end if 




30 


end If 




31 


G,i : = random ~ select{A) 




32 


end for 




33: end for 


34 


A : =x,i {\f fitnesses{x„) ^ fitnesses{u)yueA) 


>Non-dominated sorting is applied to the updated archive 


35: CrowdingSort(A) 


>crowding distance sorting for archive 


36: From step-6 to step-33 are repeated according to number of iteration 
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With respect to Prostate data, it is evident from Table 2 that for 
each score metric the proposed method outperforms (0.8962, 0.9, 
0.898, 0.9002, 0.964) the singleobjective versions, T-test, Ranksum 
test, SFS and SEE. Regarding sensitivity, our method is better 
than Graph-based and Cluster-based but differs slightly with CFS 
and mRMR (miq). Again with respect to specificity, the 



performance is average. In case of accuracy and fscore, proposed 
method is better than mRMR (miq) and Cluster-based method but 
not as good as CFS and Graph-based method. The AUC 
produced by the proposed method is 0.964 which is better than 
all the other methods. Except T-test, our method produces 0.4714 
as average correlation which is less compared to that for the other 
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Table 4. Gene Markers Identified by the Proposed Method for Various Dataset. 





Data set 


Gene ID 


Symbol 


Description 


Up or Down 




Prostate 


37639_ar 


HPN 


Hepsin 


up 


Cancer 


32243_g_a; 


CRYAB 


crystallin, alpha B 


up 




33904_ar 


CLDN3 


claudin 3 


up 




41504j_fl* 


MAP 


v-maf musculoaponeurotic fibrosarcoma oncogene homolog 


up 




40435_ar 


SLC25A6 


solute carrier family 25, member 6 


down 




33614_ar 


RPL18A, 


ribosomal protein LI 8a, LI 8a pseudogene 3 


down 


RPL18AP3 


DLBCL 


X02l52_at 


LDHA 


lactate dehydrogenase 


down 




A/14328j_ar 


ENOl 


enolase 1 (alpha) 


down 




!759309_a; 


FH 


fumarate hydratase, mitochondrial precursor 


down 


Child 


41117j_o/ 


SLC9A3R2 


solute carrier family 9, isoform 3 regulator 2 


down 


ALL 


37226_flr 


BNIPl 


BCL2/adenovirus ElB 19 KDa interacting protein 1 


down 




33069/_ar 


UGT2B15 


UDP glucuronosyl transferase 2 family, polypeptide B15 


down 




34757_flr 


PARP2 


poly (ADP-ribose) polymerase 2 


down 




39335_ar 


EIF5AL1, 


eukaryotic translation initiation factor 5A-likel and 5A 


down 


EIF5A 
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methods. This indicates that non-correlated genes are identified. 
As population-based optimization techniques take more time to 
execute, therefore time complexity of our method is 81.176 Sec. 
which is not so high than other comparative methods. 

For DLBCL data. Table 2 shows that with respect to average 
sensitivity, fscore and AUG our proposed technique (0.9111, 
0.8428 and 0.9644) uniformly scores better than all the other 
methods. With respect to specificity, proposed method has scored 
better than all but CPS and Graph-based method. The accuracy 



produced by our method is also better than others except Graph- 
based method. Although CFS and Graph-based method result less 
correlated genes but their sensitivity is very bad. Time complexity 
for proposed method is higher than others but however, the 
dilference is not very high. 

Moreover, for Child-ALL data, it is clear from Table 2 that the 
proposed scheme has established its superiority in case of 
sensitivity, accuracy. But with respect to average specificity the 
score is 0.8233 which is not better than singleobjective (SNR), 



Nonnal Tissue 



52243_g_at 
37639 at 



Pro state Tumor 




Figure 2. The Heatmap of the gene markers for Prostate Cancer data. The Heatmap describe the expression levels of the four up-regulated 
and two down-regulated gene markers for normal and cancerous type in Prostate Cancer data. 
doi:l 0.1 371/journal.pone.0090949.g002 



PLOS ONE I www.plosone.org 



10 



March 2014 | Volume 9 | Issue 3 | e90949 



Finding Non-Redundant and Relevant Gene Markers 



DLBCL Tissue 



^ ^ FL Tissue ^ 



U59309_at 
M14328_s_at 
X02152 at 




Figure 3. The Heatmap of the gene markers for DLBCL data. The IHeatmap describe the expression levels of the three down-regulated gene 
markers for DLBCL and FL type in DLBCL data. 
doi:1 0.1 371/journal.pone.0090949.g003 



Ranksum test, SFS, CFS, mRMR (miq) and Graph-based method. 
But with respect to fscore and AUG, most of the time, proposed 
method produce better score than others. Again average 
correlation of the proposed method is 0.7324 which is also the 
lower than others except GFS. Hence the proposed technique 
uniformly yields better values which prove the superiority of our 
proposed technique. 

Cross-Validation Performance 

The performance analysis is extended using 10-fold cross 
validation. AH the algorithms are executed on the total sample 
versus gene dataset and the output genes are validated using 10- 
fold cross-validation using Support Vector Machine (SVM). The 
cross-validation scores of different algorithms are reported in 
Table 3. It is clear from the table that for the prostate dataset, with 
respect to sensitivity, specificity, accuracy and fscore proposed 
method outperforms than other methods except CFS. With 
respect to AUG, our method is better than GFS, mRMR(miq), 
Graph-based and Gluster-based. The average correlation for our 
method is very much lower than other methods i.e. proposed 
method results more non-redundant features than other 
comparative methods. But it is obvious from the table that it took 
more time to execute than others. In case DLBGL dataset, with 
respect to sensitivity, accuracy, fscore and AUG, the proposed 
method performs best among all the methods. With respect to 
specificity, the proposed method performs slightly less than 
singleobjective (SNR), T-test, Ranksum test, CFS and Graph- 



based method. The average correlation produces by the proposed 
technique is less than other methods except mRMR (miq). It can 
also be noticed from the table that the execution time for the 
proposed method is 3.6832 Seconds but the difference with other 
method is less. For the ChUd-ALL dataset, with respect to 
accuracy, fscore and AUG the proposed method performs better 
than other comparative methods. With respect to sensitivity, the 
score is average and less than other methods. The specificity 
scored by the proposed technique is 0.719 which highly better 
than other methods except Graph-based method. The proposed 
method produced 0.6764 as average correlation which is less than 
other methods except CFS. 

Gene Marker Analysis 

After executing the proposed technique 10 times we got 10 
feature sets. Thereafter we took those genes as maker which 
appears at least 5 times in the 1 0 feature sets. Table 4 describes the 
gene markers ID, Symbol and Description for the three datasets. 
Among the gene markers, many of those have already been 
vahdated to be associated with the respective cancer classes in 
different existing literature. Such as for prostate cancer data the 
gene 32243_g_at (CRYAB) and 33904_fl? (CLDN3) have been 
reported in [46] and 37639_flr (HPN) and 41504j_fl? (MAF) have 
been reported in [47]. Also the genes XQ2l52_at (LDHA) and 
AfI4328j_fl? (ENOl) of DLBCL have been reported in [48]. 
Again in [49], the genes 41117_i_ar (SLC9A3R2), 33069_/_a? 
(UGT2B15)of Child- ALL data are reported. In Fig. 2, Fig. 3 and 



39335 at 



34757 at 



33069 f at 



37226 at 



41117 s at 



Child ALL after ther^y 



Child ALL before therapy 



I 




Figure 4. The Heatmap of the gene mariners for Child-ALL data. The Heatmap describe the expression levels of the five down-regulated gene 
markers for after and before therapy in Child-ALL data. 
doi:1 0.1 371/journal.pone.0090949.g004 
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Fig. 4, the heatmaps of the feature sets identified by our proposed 
technique for prostate dataset, DLBCL dataset and child-all 
dataset are shown respectively. The heatmaps show gene versus 
sample matrix. The cells of the heatmap represent the expression 
levels of the genes in terms of colors. The red shades represent 
high expression levels whereas the green shades represent low 
expression levels and the colors towards black represent the 
medium expression values. It is evident from the figures (2, 3 and 
4) that the gene markers for each tumor subtype has either high 
expression values (Up-regulated) or low expression values (Down- 
regulated) over all the samples of the respective tumor class. From 
Fig. 2, it is clear that the genes 37639_af (HPN), 32243_g_ar 
(CRYAB), 33904_ar (CLDN3) and 41504j_af (MAF) are up- 
regulated (high expression value in normal tissue and low 
expression in tumor tissue) and genes 40435_a? (SLC25A6) and 
33614_a7 (RPL18A) are down-regulated (vice-versa). Then it can 
be seen from Fig. 3 that the genes Jf02152_a7 (LDHA), 
M14328j_flr (ENOl) and C/59309_a/ (FH) are all down- 
regulated with respect to DLBCL to FL. Subsequently, for chUd- 
AT.L data all genes are down-regulated because Fig. 4 depicts that 
high expression value in before-therapy class and low expression 
value in after-therapy class. 

Conclusion 

In this proposed study, the problem of supervised feature 
selection is posed as relevant and non-redundant gene markers 

References 

1. Kohavi R, John G (1997) Wrapper for feature subset selection. Artificial 

Intelligence 97: 273-324. 

2. Ruiza R, Riquelmea J, Aguilar-Ruizb J (2006) Incremental wrapper-based gene 
selection from mieroarray data for cancer classification. Pattern Recognition 39: 
2383-2392. 

3. Mitra P, Murthy C, Pal S (2002) Unsupervised feature selection using feature 
similarity. IEEE Transaction on Pattern Analysis and Machine Intellegence 24: 
301-312. 

4. Jiang S, Wang L (2012) An unsupervised feature selection framework based on 
clustering. In: New Frontiers in Applied Data Mining. 

^. Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster 
data. In: KDDIO Washington DC USA. 

6. Dy J, Brodley C, Kak A, Broderick L, Aisen A (2003) Unsupervised feature 
selection applieel to content-based retrieval of lung images. IEEE Transaction on 
Pattern Analysis and Machine Intellegence 25: 373-378. 

7. Morita M, Oliveira L, Sabourin R (2004) Unsuper\'ised feature selection for 
ensemble of classifiers. In: Frontiers in Handwriting Recognition. 

8. Zhang Z, Hancock E (2011) A graph-based approach to feature selection. 
Springer. 

9. Bahmani B, Kumar R, Vassilvitskii S (2012) Densest subgraph in streaming and 
mapreduce. VLDB Endowment ,5: 4,54—46.5. 

10. Li Y, Lu B, Wu Z (2006) A hybrid method of unsupervised feature selection 
baseel on ranking. In: llLElL Clomputer Society Washington DC USA. 

11. Liu Y, Wang G, Chen H, Dong H, Zhu X, et al. (2011) An improved particle 
swarm optimization for feature selection. Journal of Bionic Engineering 97: 191— 
200. 

12. Tang E, Suganthan P, Yao X (2005) Feature selection for mieroarray data using 
least squares svm and particle swarm optimization. In: IEEE Symposium on 
Computational Intell^ence in Bioinformatics and Computational Biology. 

13. Chen LF, Su CT, Chen KH (2012) An improved particle swarm optimization 
for feature selection. Intelligent Data Analysis 16: 167—182. 

14. Mohamad M, Omatu S, Deris S, Yoshioka M, Abdullah A, et al. (2013) An 
enhancement of binary particle swarm optimization for gene selection in 
classifying cancer classes. Algorithms for Molecular Biology 8. 

15. Xuc B, Cervante L, Shang L, Browne W, Zhang M (2012) A multi-objective 
particle swarm optimisation for filter-based feature selection in classification 
problems. Connect Sci 24: 91-116. 

16. lashkargir M, Monadjemi S, Dastjerdi A (2009) A hybrid multi-objective particle 
swarm optimization method to discover biclusters in mieroarray data. 
International Journal of Computer Science and Information Security 4. 

17. Xue B, Zhang M, Browne W (2013) Particle swarm optimization for feature 
selection in classification: A multi-objective approach. IEEE Transaction On 
Cybernetics 43: 1656-1671. 

18. Deb K (2001) Multi-objective Optimization Using Evolutionary Algorithms. 
England: John Wiley and Sons. 



identification from mieroarray gene expression data. The micro- 
array data matrix has been converted into feature-dissimilarity 
graph where nodes stand for features. The nodes and edges are 
weighted according to feature relevance and dissimilarity value 
between features, respectively. Then the densest subgraph having 
maximum average node and edge weight has been identified that 
means features with high relevance and less redundant are selected 
as output. For identifying subgraph having non-redundant and 
relevant feature nodes, a graph based multiobjective bPSO has 
been proposed. Here, bPSO has been modeled using multi- 
objective framework which is based on non-dominated sorting and 
crowding distance sorting. Three real life datasets have been used 
for performance analysis. The comparative study between the 
proposed technique and its single objective versions, T-test and 
Ranksum test has been performed. Moreover, gene marker 
analysis with respect to each dataset is also illustrated. As a future 
scope, we plan to incorporate a supervised wrapper based 
approach to calculate objective functions using fuzzy association 
rules. 

Author Contributions 

Conceived and designed the experiments: AM MM. Performed the 
experiments: MM. Analyzed the data: MM. Contributed reagents/ 
materials/ analysis tools: AM. Wrote the paper: MM. Framework design: 
AM MM. 



19. Cocllo CC (2002) Evolutionary multiobjective optimization: a historical view of 
the field,. IEEE (.Computational Intelligence Magazine 1: 28— !?6. 

20. Chuang L, Hsiao C. Yang G (2011) An improved binary particle swarm 
optimization with complementary distribution strate^g}' for feature selection. In: 
International Conference on Machine Learning and Computing. 

21. Cheok M, Yang W, Pui C, Downing J, Cheng C, et al. (2003) Characterization 
of pareto dominance. Operations Research Letters 31. 

22. Maulik U, Mukhopadhyay A, Bandyopadhyay S (2009) Combining pareto- 
optimal clusters using supervised learning for identifying co-expressed genes. 
BMC Bioinformatics 10. 

23. Yoon Y, Lee J, Park S, Bien S, Chung H, et al. (2008) Direct integration of 
microarrays for selecting informative genes and phenotype claissification. Pattern 
Recognition 178: 88-105. 

24. Alon U, Barkai N, Notterman D, Gish K, Ybarra S, et al. (1999) Broad patterns 
of gene expression revealed by clustering analysis of tumor and normal colon 
tissues probed by oligonucleotide arrays. Proc Nad Acad Sci USA 96: 6745— 
6750. 

25. Gordon G, Jensen R, Hsiao L, Gullans S, Blumenstock J, et al. (2002) 
Translation of mieroarray data into clinically relevant cancer diagnostic tests 

using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62: 
4963-4967. 

26. Jaeger J, Sengupta R, Ruzzo W (2003) Improved gene selection for classification 
of microarrays. In: Pac Symp Biocomput. 

27. Hanczar B, Courtine M, Benis A, Henncgar C, Clement K, et al. (2003) 
Im];)roving classification of mieroarray data using prototype-based feature 
selection. In: SIGKDD Explor Newslett. 

28. M-Cedeno A, QjDominguez J, C-Januchs M, Andina D (20 1 0) Feature selection 
using sequential forward selection and classification applying artificial metaplas- 
ticity neural network. In: Proc of the IEEE Industrial Electronics Society. 

29. Mao K (2004) Orthogonal forward selection and backward elimination 
algorithms for feature subset selection. IEEE Transactions on Systems, Man 
and Cybernetics-Part B: Cybernetics 34: 629—634. 

30. Hall M, Smith L (1999) Feature selection for machine learning: Comparing a 
correlation-based filter approach to the wrapper. In: Proc. of the 12th 
International FLAIRS Conference. 

31. Mankiewicz R (2000) The Story of Mathematics. Princeton University Press. 

32. Troyanskaya O, Garbcr M, Brown P, Botstcin D, Altman R (2002) 
Nonparametric methods for identifying differentially expressed genes in 
mieroarray data. Bioinformatics 18: 1454—1461. 

33. Ding C, Peng H (2005) Minimum redundancy feature selection for mieroarray 
gene expression data. Journal of Bioinformatics ans Computational Biology 3: 
185-205. 

34. Kamandar M, (ihassemian H (2011) Maximum relevance, minimum redun- 
dancy band selection for hyperspectral images. In: 19th Iranian Conference on 
Electrical Engineering (ICEE), 



PLCS ONE I www.plosone.org 



12 



March 2014 | Volume 9 | Issue 3 | e90949 



Finding Non-Redundant and Relevant Gene Markers 



35. Cover T, Thomas J (2006) Entropy, relative entropy and mutual information. 
Elements ol Information Theory John Wiley & Sons. 

36. Kamandar M, Ghassemian H (2009) A cluster-based feature selection approach. 
In; InternationEil Conference on Hybrid Artificial Intelligence Systems. 

37. Kamandar M, Ghassemiain H (2011) A graph-based approach to feature 
selection. In: International Workshop on Graph-Based Representations in 
Pattern Recognition. 

38. Eisen M, SpeUman P, Brown P, Botstein D (1998) Cluster analysis and display of 

gcnome-widc expression patterns. Proc National Academy of Seienees 95: 
14863-14867. 

39. Krausc E Taxicab geometry Addison-Wesley Innovative Series. Addison-Wesley 
Pub Go. 

40. Baya A, Larcse M, Granitto P, (iomez J, Tapia E (2007) Gene set enrichment 
analysis using non-parametric scores. Springer-X'rrlao' Berlin Heidelberg. 

41. Parsopoulos K (2010) Particle swarm optimization and intelligence: Advances 
and applications. Information science reference Hershey New York. 

42. Unler A, Murat A (2010) A discrete particle swarm optimization method for 
feature selection in binary classification problems. Pattern Recognition 206: 
528-539. 



43. Deb K, Pratap A, Agrawal S, Meyarivan T (2002) A fast and elitist 
multiobjective genetic algorithm: NSGA-11. IEEE Transactions on Evolutionary 
Computation 6: 182-197. 

44. Sierra M, Coello CC (2006) Multi-objective particle swarm optimizers: A survey 
of the state-of-the-art. International Journal of Computational Intelligence 
Research 2: 287-308. 

45. Lee I, Lushington G, Visvanathan M (2011) A filter-based feature selection 
approach for identifying potential biomarkers for lung cancer. JouruEil of Clinical 
Bioinformatics 1. 

46. Wang X, Gotoh C) (2009) Cancer classification using single genes. In: 
International Conference on Genome Informatics. 

47. Fukuta K, Okada Y (2012) Informative gene discovery in dna microarray data 
using statistical approach. In: Proc of the Intelligent Control and Innovative 
Computing. 

48. Shipp M, Ross K, Tamayo P,Weng A, Kutok J, et al. (2002) Diffuse large b-cell 
lymphoma outcome prediction by geneexpression profiling and supervised 
machine learning. Nature Medicine 8. 

49. Cheok M, Yang W, Pui C, DowningJ, (.iheng C, et al. (2003) Treatment-specific 
changes in gene expression discriminate in vivo drug response in human 
leukemia cells. Nature Genetics 34: 85-90. 



PLOS ONE I www.plosone.org 



13 



March 2014 | Volume 9 | Issue 3 | e90949 



